Backtesting overfitting

Theory and Practice

Barry Quinn

In Theory

Outline

  • Backtesting and selection bias under multiple testing
  • Precision and recall in statistical testing
  • Neyman-Pearson framework: Type I and Type II errors
  • False discovery rate in multiple hypothesis testing
  • Measuring overfitting: The Deflated Sharpe Ratio
  • Practical implementation & risk management implications

Experiment evidence using simulation

  • So far we have used experimental evidence extensively.
  • More precisely we have used monte carlo simulations to allow us to reach conclusions regarding the mathematical properties of various estimators and algorithms under controlled conditions.
  • Good financial research requires the ability to control for the conditions of an experiment that can result in realistic causal inference statements.

What is a backtest?

  • A backtest is a historical simulation of how an investment strategy would have performed in the past.
  • It is not a controlled experiment, because we cannot change the environmental variables to derive a new historical time series on which to perform an independent backtest.
  • As a result, backtests cannot help us derive the precise cause–effect mechanisms that make a strategy successful.
  • This identification issue is more than a technical inconvenience

Key Challenge: Financial data is limited, non-stationary, and path-dependent, making proper validation fundamentally different from other fields like machine learning on image or text data.

Overfitting and statistical inflation

  • In the context of strategy development, all we have is a few (relatively short, serially correlated, multicollinear and possibly nonstationary) historical time series.
  • It is easy for a researcher to overfit a backtest, by conducting multiple historical simulations, and selecting the best performing strategy (Bailey et al. 2014).
  • When a researcher presents an overfit backtest as the outcome of a single trial, the simulated performance is inflated.

This form of statistical inflation is called selection bias under multiple testing (SBuMT)

  • SBuMT leads to false discoveries: strategies that are replicable in backtests, but fail when implemented.

Defining Backtest Overfitting: When the process of backtest optimization leads to strategies that fit the noise in historical data rather than genuine market inefficiencies.

Backtest hyperfitting

  • SBuMT is compounded as a consequence of sequential SBuMT at two levels:
  1. Each researcher runs millions of simulations, and presents the best (overfit) ones to her boss
  2. The company further selects a few backtests among the (already overfit) backtests submitted by the researchers.
  • We may call this backtest hyperfitting, to differentiate it from backtest overfitting (which occurs at the researcher level).
  • It may take many decades to collect the future (out-of-sample) information needed to debunk a false discovery that resulted from SBuMT.
  • In this lecture we study how researchers can estimate the effect that SBuMT has on their findings.

Performance statistics

Performance Statistic Description
PnL The total amount of dollars (or the equivalent in the currency of denomination) generated over the entirety of the backtest, including liquidation costs from the terminal position.
PnL from long positions The portion of the PnL dollars that was generated exclusively by long positions.
Annualized rate of return The time-weighted average annual rate of total return, including dividends, coupons, costs, etc.
Hit ratio The fraction of bets that resulted in a positive PnL.
Average return from hits The average return from bets that generated a profit.
Average return from misses The average return from bets that generated a loss.

Risk statistics

  • Intuitively, a drawdown (DD)is the maximum loss suffered by an investment between two consecutive high-watermarks (HWMs).
  • The time under water (TuW) is the time elapsed between an HWM and the moment the PnL exceeds the previous maximum PnL.
  • In workshop 4 we used PortfolioAnalytics and chart the performance of our competing strategies.

- You can see the drawdown statistics in the bottom graph

Implementation shortfall statistics

Broker fees per turnover

  • Broker fees per turnover: These are the fees paid to the broker for turning the portfolio over, including exchange fees.

Average slippage per turnover

  • Average slippage per turnover: These are execution costs, excluding broker fees, involved in one portfolio turnover.

Dollar performance per turnover

  • Dollar performance per turnover: This is the ratio between dollar performance (including brokerage fees and slippage costs) and total portfolio turnovers.

Return on execution costs

  • Return on execution costs: This is the ratio between dollar performance (including brokerage fees and slippage costs) and total execution costs.

Efficiency statistics

Efficiency statistics provide a relative analysis of the performance of a backtest.

Annualized Sharpe ratio

  • Annualized Sharpe ratio: This is the SR value, annualized by a multiplying by \(\sqrt{a}\) (a=average number of returns observations per year).

Information ratio

  • Information ratio: This is the SR equivalent of a portfolio that measures its performance relative to a benchmark.

Probabilistic Sharpe ratio

  • Probabilistic Sharpe ratio: PSR corrects SR for inflationary effects caused by non-Normal returns or track record length.

Deflated Sharpe ratio

  • Deflated Sharpe ratio: DSR corrects SR for inflationary effects caused by non-Normal returns, track record length, and selection bias under multiple testing.

Precision and Recall in Statistics

  • To understand how false discoveries affect performance in algorithmic trading and investment, we must first introduce two concepts.
  • In machine learning statistics, precision and recall are measures of task specific accuracy, especially in classification problems.
  • In terms of investment strategies:

precision is the estimated probability that a randomly selected investment strategy from the pool of all positive backtests is a true strategy.

recall (or true positive rate) is the estimated probability that a strategy randomly selected from the pool of true strategy has a positive backtest

The Neyman-Pearson Framework

Under the standard Neyman-Pearson [1933] hypothesis testing framework:

  • We state a null hypothesis H0, and an alternative hypothesis H1
  • We derive the distribution of a test statistic under H0 and under H1
  • We reject H0 with confidence \(1-\alpha\) in favour of H1 when we observe an event that, should H0 be true, should only occur with probability \(\alpha\)
  • This framework is the statistical analogue to a proof by contradiction argument
  • There are 4 probabilities associated with a predicted positive \(x >\tau_{\alpha}\)
  • \(Pr(x >\tau_{\alpha}|H_0)=\alpha\) the type I error probability, or significance or false positive rate
  • \(Pr(x >\tau_{\alpha}|H_1)=1-\beta\) is the power of the test, recall or true positive rate, \(Pr(x \leq\tau_{\alpha}|H_1)=\beta\) is the type II error probability or false negative rate
  • \(Pr(H_0|x>\tau_{\alpha})\) the false discovery rate (FDR)
  • \(Pr(H_1|x>\tau_{\alpha})\) the test’s precision
  • Note again that p-value \(\alpha\) does not give the probability that the null hypothesis is true.

A mathematical argument (Lopez de Prado 2020)

  • Let’s say you have \(s\) investment strategies to analyze as a quant researcher.
  • Inevitably, some of these strategies are false discoveries, in the sense that their expected return is not positive.
  • Mathematically, we can denote:

\[s=s_T+s_F \\ \text{vvwhere } \\ s_T=\text{number of true strategies} \\ s_F=\text{number of false strategies}\]

  • Let \(\theta\) be the odds ratio of true strategies against false strategies, \(\theta=s_T/s_F\).

A mathematical argument (Lopez de Prado 2020)

  • In finance, where the signal-to-noise ratio is low, false strategies abound, hence \(θ\) is expected to be low. The number of true investment strategies is:

\[S_T=s\times \frac{s_T}{s_T+s_F}\]

  • Likewise, the number of false investment strategies is:

\[S_F=S-S_T=s \left( 1-\frac{\theta}{(1+\theta)}\right)=s\frac{1}{(1+\theta)} \]

  • Given a false positive rate \(\alpha\) (type I error), we will obtain a number of false positives, \(FP=\alpha\times S_F\), and a number of true negatives, \(TN=(1-\alpha)s_F\).

A mathematical argument (Lopez de Prado 2020)

  • Denote \(\beta\) the false negative rate (type II error) associated with that \(\alpha\).
  • We will obtain a number of false negatives, \(FN=\beta \times s_F\), and a number of true positives, \(TP=(1-\beta)s_T\).
  • Thus:

\[\text{precision}=\frac{TP}{(TP+FP)} = \frac{(1-\beta)s_T}{(1+\beta)s_T+\alpha s_F} \\ =\frac{(1-\beta)s\frac{\theta}{(1+\theta)}}{(1-\beta)s\frac{\theta}{(1+\theta)}+\alpha s\frac{\theta}{(1+\theta)}}=\frac{(1-\beta)\theta}{(1-\beta)\theta+\alpha}\]

\[\text{recall}=\frac{TP}{(TP+FN)}=\frac{(1-\beta)s_T}{(1-\beta)s_T+\beta s_T}=1-\beta\]

A mathematical argument (Lopez de Prado 2020)

  • What the mathematical logic tells us is before running backtests on a strategy, researchers should gather evidence that a strategy may indeed exist.
  • The reason is, the precision of the test is a function of the odds ratio \(\theta\).
  • If the odds ratio is low, the precision will be low, even if we get a positive with high confidence (low p-value).

This is evidence to the pitfall that p-values report a rather uninformative probability. It is possible for a statistical test to have high confidence (low p-value) and low precision.

In particular, a strategy is more likely false than true if \((1-\beta)\theta < \alpha\) such that precision is less than 50%.

  • Finally, there is an important relationship between the false discovery rate (FDR) and precision.
  • Specifically,

\[FDR=\frac{FP}{(FP+TP)}=\frac{\alpha}{(1-\beta)\theta+\alpha}=1-precision\]

A FDR function

  • The following is a simple function which calculates precision, recall and the false discovery rate.
fdr_anal <- function(ground_truth, alpha = 0.05, beta, trails) {
  theta = ground_truth / (1 - ground_truth)
  recall = 1 - beta          
  b1 = recall * theta
  precision = b1 / (b1 + alpha)
  tibble(Recall = recall, Precision = precision, FDR = 1 - precision)
}
  • Suppose before running backtests on a strategy, the researcher knows the truth that there is a 1% chance that the strategy is profitable.
  • If she sticks with the standard convention of 5% significance level and a 20% chance of a false negative, and runs 1000 trails, what is the rate of false discoveries?
fdr_anal(0.01, beta = 0.2)
# A tibble: 1 × 3
  Recall Precision   FDR
   <dbl>     <dbl> <dbl>
1    0.8     0.139 0.861
  • For this reason alone, we should expect that most discoveries in financial econometrics are likely false.

Familywise Error Rate (FWER)

  • When Neyman and Pearson [1933] proposed this framework, they did not consider the possibility of conducting multiple tests and select the best outcome.

  • When a test is repeated multiple times, the combined \(\alpha\) increases.

  • Consider that we repeat for a second time a test with false positive probability \(\alpha\).

  • At each trial, the probability of not making a Type I error is \(1-\alpha\)

  • If the two trials are independent, the probability of not making a Type I error on the first and second tests is \((1-\alpha)^2\)

  • The probability of making at least one Type I error is the complementary, \(1-(1-\alpha)^2\)

  • After a family of K independent tests, we reject H0 with confidence \((1-\alpha)^K\)

  • FWER the probability that at least one of the positives is false, \(\alpha_K=1-(1-\alpha)^K\)

  • The Sidak Correction: for a given K and \(\alpha_K\) then \(\alpha=1-(1-\alpha_K)^{1/K}\)

FWER vs FDR

  • Thus far we have defined 2 Type 1 errors for multiple testing:
  1. Familywise Error Rate (FWER): The probability that at least one false positive takes place.
  2. False Discovery Rate (FDR): Expected value of the ratio of false positives to predicted positives.
  • In most scientific and industrial applications, FWER is considered overly punitive.
    • For example, it would be impractical to design a car model where we control for the probability that a single unit will be defective.

FWER vs FDR

  • However, in the context of finance, the FDR is preferrred as an investor does not typically allocate funds to all strategies with predicted positives within a family of trials, where a proportion of them are likely to be false.

  • Instead, investors are only introduced to the single best strategy out of a family of thousands or even millions of alternatives

  • Investors have no ability to invest in the discarded predicted positives.

  • Following the car analogue, in finance there is actually a single car unit produced per model, which everyone will use. If the only produced unit is defective, everyone will crash.]

What does this all mean for quantitative finance

  • Selection bias under multiple backtesting makes it impossible to assess the probability that a strategy is false.

  • Lopez de Prado (2018) argues that this explains why most quantitative investment firms fail as they are likely investing in false positives

  • This is because most financial analysts typically assess performance on the Sharpe ratio, not precision and recall.

  • Lopez de Prado (2020) develops a framework to assess the probability that a strategy is false, using the Sharpe ratio estimate and metadata from the discovery process as inputs

The golden age of the Sharpe Ratio (1966-2012)

  • In 1966, William Sharpe proposed a ratio metric that would come to dominate investment strategy appraisal
  • Consider an investment strategy with excess returns (or risk premia) \(r_t, t=1,...,T\) which follows an IID Normal distribution

\[ r_t \sim N(\mu,\sigma)\]

  • Non-annualised SR of such a strategy is defined as

\[SR=\frac{\mu}{\sigma}\]

  • as the parameters \(\mu\) and \(\sigma\) are unknown, they must be estimated such that SR is estimated as:

\[\hat{SR}=\frac{E(r_t)}{\sqrt{V_{r_t}}}\]

2002 Andrew Lo and Elmar Mertens

  • Andrew Lo show that under the assumption that \(r_t \overset{IID}{\sim} N(\mu,\sigma)\) the asymptotic distribution of \(\hat{SR}\) is

\[(\hat{SR}-SR) \overset{a}{\to} N \left[0,\frac{1+0.5SR^2}{T}\right]\]

  • Subsequent evidence showed hedge fund returns exhibit substantial negative skewness, and positive excess kurtosis.

  • the implication being that assumed IID normal returns will grossly underestimate the false positive probability

  • Elmar Mertons then derived an asymptotic distribution for \(\hat{SR}\) that include a variance terms which incorporated skewness and kurtosis.

2012 David Bailey and Marco lopez de Prado

  • PSR estimates the probability that the observed \(\hat{SR}\) exceeds SR* as:

\[\hat{PSR}(SR*)=Z\left(\frac{(\hat{SR}-SR*)\sqrt{T-1}}{\sqrt{1-\hat{\gamma_3}\hat{SR}+\frac{\hat{\gamma_4}-1}{4}\hat{SR}^2}}\right)\]

  • where \(Z[.]\) is the cumulative density function of the standard Normal distribution, \(T\) is number of observed returns, and \(\hat{SR}\) is the non-annualised estimate of SR, computed on the same frequency as the \(T\) observations.

Inference on the Probabilistic Sharpe Ratio

  • For a given \(SR*\), the probabilistic sharpe ratio increases with greater mean returns, lower variance of returns, longer track record (\(T\)), positively skewed returns, and thinner tails

The False Strategy Theorem

  • Bailey et al. (2014) formalised a theorem, False Strategy Theorem, that expressed the SBuMT as a function on the number of trails and the variance of the Sharpe ratios.
  • In practice a researcher may carry out a large number of historical simulations (trails) and report only the best outcome (maximum Sharpe ratio)
  • Maximum Sharpe ratio is not randomly distributed which gives rise to SBuMT, so when more than one trail takes place the maximum Sharpe ratio is greater than the expected value of the Sharpe ratio from a random trail.
  • The theorem shows that given a investment strategy with an expected Sharpe ratio of zero and non-zero variance, the expected value of the maximum Sharpe ratio is strictly positive and a function of the number of trails

The False Strategy Theorem

  • Given a sample of IID-Gaussian Sharpe ratios \(\widehat{SR_k},k=1,..,K\) with \(\widehat{SR_k} \sim N(0,V(\widehat{SR_k}))\)

\[E(\underset{k}{\max}(\widehat{SR_k}))V(\widehat{SR_k})^{-0.5} \approx (1-\gamma)Z^{-1} \left[1-\frac{1}{K}\right]+\gamma Z^{-1}\left[1-\frac{1}{Ke}\right]\]

  • where \(Z^{-1}\) is the inverse of the standard Gaussian CDF, \(e\) is Euler’s number, and \(\gamma\) is the Euler-Mascheroni constant.

  • Corollary: Unless \(\underset{k}{\max}(\widehat{SR_k}) >> E(\underset{k}{\max}(\widehat{SR_k}))\) the discovered strategy is likely to be a false positive.

  • But \(E(\underset{k}{\max}(\widehat{SR_k}))\) is usually unknown, ergo SR is dead.

The False Strategy theorem

  • .Lopez de Prado (2020) code
  • The theorem can be used to express the magnitude of the SBuMT as the difference between the expected maximum Sharpe ratio and the expected Sharpe ratio of a false strategy from a random trail

The False Strategy theorem R

getExpectedMaxSR<-function(nTrails,meanSR,stdSR){
  # Expected Max SR controlling for SBuMT
  emc=0.577215664901532860606512090082402431042159336
  sr0=(1-emc)*qnorm(p=1-1./nTrails)+emc*qnorm(1-(nTrails*exp(1))^(-1))
  sr0=meanSR+stdSR*sr0
  return(sr0)
}

Distribution of Maximum SR

getDistMaxSR<-function(nSims,nTrails,meanSR,stdSR){
  out=tibble("Max{SR}"=NA,"nTrails"=NA)
  for (nTrails_ in nTrails) {
    #1) Simulated Sharpe Ratios
    set.seed(nTrails_)
    sr<-array(rnorm(nSims*nTrails_),dim = c(nSims,nTrails_))
    sr<-apply(sr,1,scale) # demean and scale
    sr= meanSR+sr*stdSR
    #2) Store output
    out<-out %>% bind_rows(
      tibble("Max{SR}"=apply(sr,2,max),"nTrails"=nTrails_))
  }
  return(out)
}

Run the experiment

library(pracma)
# Create a sequential on the log-linear scale
nTrails<-as.integer(logspace(1,4,100)) %>% unique()
plot(nTrails)
sr0=array(dim = length(nTrails))
for (i in seq_along(nTrails)) {
  sr0[i]<-getExpectedMaxSR(nTrails[i],meanSR = 0, stdSR = 1)
}
sr1=getDistMaxSR(nSims = 1000,nTrails = nTrails,meanSR = 0,stdSR = 1)

Most important plot in Quantitative finance

Inference from plot

  • The experiment compares the empirical (Monte Carlo) estimate of Maximum Sharpe ratio under the null of a false strategy to that implied by the FS theorem
  • The plot shows the output of the experiment for 1 to 10,000 trails.
  • The code sets \(V[\hat{SR_k}]=1\) and simulates the maximum Sharpe ratio 500 times, to derive a distribution of maximum Sharpe ratios for any k (number of trails).
  • the y axis shows the distribution of the \(max_k(\hat{SR_k})\) and the Expect
  • this results is profound, after only 100 independent backtests the expected maximum Sharpe ratio is 3.2, even when the true Sharpe ratio is zero.
  • The reason is Backtest overfitting: when selection bias (picking the best results) takes place under multiple testing (running many alternative configurations) that backtests are likely to be false discoveries.

A Solution

\[\widehat{DSR} \equiv \widehat{PSR}(\widehat{SR_0})=Z \left[\frac{(\hat{SR}-E[\max_k(\widehat{SR_k})])\sqrt{T-1}}{\sqrt{1-\hat{\gamma_3}\widehat{SR}+\frac{\hat{\gamma_4}-1}{4}\widehat{SR}^2}}\right]\]

  • \(\widehat{DSR}\) can be interpreted as the probability of observing a Sharpe ratio greater or equal to \(\widehat{SR}\) subject to the null hypothesis that the true Sharpe ratio is zero, while adjusting for skewness \(\gamma_3\), kurtosis \(\gamma_4\), sample length and multiple testings.

  • Calculate DSR requires the estimation \(E[\max_k(\widehat{SR_k})]\) which requires estimating \(K\) and \(V(\hat{SR})\) which is where FML can help.

  • Specifically, we are employ optimal number of clustering to estimate \(K\) the effective number of trails and then calculate the variances.

Implications for Academics

  • Most studies in empirical finance are false (Harvey et al., 2016)
  • Selection bias may invalidate the entire body of work performed for the past 100 years
  • Finance cannot survive as a discipline unless we solve this problem
  • Investors and regulators have no reason to trust the value added by researchers and asset managers unless we learn to prevent false discoveries
  • Applying the False Strategy theorem to prevent false positives in finance
  • Requires estimating two meta-research variables to discount for “lucky findings”
  • Academic journals should cease accepting papers that do not control for selection bias under multiple testing
  • Papers must report the probability that the claimed financial discovery is a false positive

Implications for Regulators

  • Before the FDA, adulteration and mislabeling of food and drugs caused frequent episodes of mass poisoning, birth defects, and death
  • Financial firms engaging in backtest overfitting defraud investors for tens of billions of dollars annually
  • The SEC could demand quantitative firms certify the probability that promoted investments are bogus
  • Quantitative firms should be required to store all trials involved in a discovery for post-mortem analysis

Implications for Investors

  • Many financial firms promote pseudo-scientific products as scientific
  • Investment products based on award-winning journal articles are not necessarily scientific
  • If the original author has not become rich with the discovery, investors’ chances are slim
  • Investors should demand firms report the results of all trials, not only the best-looking ones
  • Investors should consult databases of investment forecasts and assess the credibility of gurus and financial firms based on all outcomes from past predictions

The Overfitting Spectrum

Low Risk of Overfitting

  • Few parameters (1-3)
  • Simple decision rules
  • Strong economic rationale
  • Consistent performance across markets
  • Limited configuration variations tested
  • Long out-of-sample periods

High Risk of Overfitting

  • Many parameters (10+)
  • Complex decision rules
  • No clear economic explanation
  • Performance varies across markets
  • Thousands of configurations tested
  • Short out-of-sample periods

The risk of false discovery increases exponentially with the number of configurations tested

Visualizing Precision and FDR

Critical Result: Even with a 5% significance level, when the odds ratio is low (common in finance), the false discovery rate can exceed 80%.

Practical Implementation of DSR

  1. Data Collection
    • Gather all trials/backtests performed
    • Record Sharpe ratios and their variances
    • Document the number of configurations tested
  2. Estimate Parameters
    • Calculate effective number of trials (K)
    • Estimate variance of Sharpe ratios
    • Compute skewness and kurtosis
  3. Calculate DSR
    • Apply the DSR formula
    • Interpret results in terms of probability

Best Practice: Document and store all trials, not just the successful ones.

Real-World Strategy Evaluation Framework

Performance Metrics

  • Sharpe Ratio
  • Sortino Ratio
  • Calmar Ratio
  • Maximum Drawdown
  • Win Rate

Robustness Checks

  • Out-of-sample testing
  • Monte Carlo simulations
  • Stress testing
  • Parameter stability

A comprehensive evaluation requires both performance metrics and robustness checks

Future Directions and Open Questions

  1. Methodological Challenges
    • Estimating true number of trials
    • Handling non-stationary markets
    • Accounting for transaction costs
  2. Regulatory Framework
    • Standardizing backtest reporting
    • Establishing minimum requirements
    • Creating audit trails
  3. Research Opportunities
    • Machine learning applications
    • Alternative performance metrics
    • Cross-market validation

Key Question: How can we develop more robust methods for strategy validation in an era of increasing data availability and computational power?

The Practice

From Theory to Practice

  • We’ve covered the theoretical foundations of backtest overfitting
  • Now we’ll focus on practical implementation
  • Key questions:
    • How do we calculate the Deflated Sharpe Ratio in practice?
    • How do we estimate the effective number of trials?
    • What practical workflows can prevent false discoveries?

A Practical DSR Workflow

  1. Strategy development and backtesting
  2. Estimation of effective number of trials (\(K\))
  3. Estimation of Sharpe ratio variance
  4. Calculation of expected maximum Sharpe ratio
  5. Calculation of Deflated Sharpe Ratio
  6. Evaluation against DSR threshold

Estimating the Effective Number of Trials

  • In practice, strategies are often highly correlated
  • The effective number of independent trials (\(K\)) is typically much lower than the total number of configurations tested
  • Methods to estimate \(K\):
    1. Clustering of strategy returns
    2. Principal Component Analysis
    3. Researcher’s logs of configuration tests

Clustering Approach to Effective Trials

# Generate correlated strategy returns
set.seed(42)
n_strategies <- 50
n_returns <- 252
base_returns <- matrix(rnorm(10 * n_returns), nrow = n_returns)

# Create strategies with varying correlations to base returns
strategies_returns <- matrix(0, nrow = n_returns, ncol = n_strategies)
for(i in 1:n_strategies) {
  # Mix of base returns and unique noise
  weight <- runif(1, 0.3, 0.9)
  base_idx <- sample(1:10, 1)
  strategies_returns[,i] <- weight * base_returns[,base_idx] + 
                           (1-weight) * rnorm(n_returns)
}

# Calculate correlation matrix
cor_matrix <- cor(strategies_returns)
# Convert to distance matrix
dist_matrix <- as.dist(1 - abs(cor_matrix))
# Hierarchical clustering
hc <- hclust(dist_matrix, method = "complete")

# Plot dendrogram
plot(hc, main = "Hierarchical Clustering of Strategy Returns", 
     xlab = "", sub = "", cex = 0.6)
rect.hclust(hc, k = 8, border = "red")

Visualizing Effective Number of Trials

# Convert strategies_returns to a data frame
strategies_df <- as.data.frame(strategies_returns)
colnames(strategies_df) <- paste0("Strategy", 1:ncol(strategies_df))

# Calculate optimal number of clusters using silhouette method
# Note: We'll use the data frame directly instead of the distance matrix
fviz_nbclust(strategies_df, FUN = hcut, method = "silhouette", 
             k.max = 15) +
  labs(title = "Optimal Number of Clusters",
       subtitle = "Using Silhouette Method")
# Cut tree at optimal number
k_opt <- 8  # Based on silhouette plot
clusters <- cutree(hc, k = k_opt)

# Show the first few strategies and their cluster assignments
head(tibble(Strategy = 1:n_strategies, Cluster = clusters), 10)
# A tibble: 10 × 2
   Strategy Cluster
      <int>   <int>
 1        1       1
 2        2       2
 3        3       3
 4        4       4
 5        5       4
 6        6       5
 7        7       6
 8        8       6
 9        9       7
10       10       3

Calculating DSR with Estimated Effective Trials

# Function to calculate DSR
calculate_dsr <- function(strategy_returns, n_effective_trials, 
                          mean_sr = 0, sr_variance = NULL) {
  # Calculate Sharpe ratio and its components
  n <- length(strategy_returns)
  sr <- mean(strategy_returns) / sd(strategy_returns)
  
  # Calculate skewness and kurtosis
  z <- (strategy_returns - mean(strategy_returns)) / sd(strategy_returns)
  skew <- sum(z^3) / n
  kurt <- sum(z^4) / n - 3  # Excess kurtosis
  
  # If SR variance not provided, estimate it
  if (is.null(sr_variance)) {
    sr_variance <- 1  # Simplification for example
  }
  
  # Calculate expected maximum SR
  emc <- 0.577215664901532860606512090082402431042159336  # Euler-Mascheroni
  exp_max_sr <- (1 - emc) * qnorm(p = 1 - 1/n_effective_trials) + 
               emc * qnorm(1 - (n_effective_trials * exp(1))^(-1))
  exp_max_sr <- mean_sr + sqrt(sr_variance) * exp_max_sr
  
  # DSR calculation
  numerator <- (sr - exp_max_sr) * sqrt(n - 1)
  denominator <- sqrt(1 - skew * sr + (kurt / 4) * sr^2)
  dsr <- pnorm(numerator / denominator)
  
  return(list(
    sharpe_ratio = sr,
    expected_max_sr = exp_max_sr,
    dsr = dsr
  ))
}

# Calculate DSR for a sample strategy
sample_strategy <- strategies_returns[,1]
dsr_results <- calculate_dsr(
  sample_strategy, 
  n_effective_trials = k_eff
)

# Display results
cat("Strategy Sharpe Ratio:", round(dsr_results$sharpe_ratio, 4), "\n")
Strategy Sharpe Ratio: 0.0864 
cat("Expected Max SR with", k_eff, "trials:", 
    round(dsr_results$expected_max_sr, 4), "\n")
Expected Max SR with 8 trials: 1.459 
cat("Deflated Sharpe Ratio:", round(dsr_results$dsr, 4), "\n")
Deflated Sharpe Ratio: 0 

Visual Representation of Precision and FDR

# Function to calculate precision and FDR
calculate_precision_fdr <- function(theta, alpha = 0.05, beta = 0.2) {
  recall <- 1 - beta          
  b1 <- recall * theta
  precision <- b1 / (b1 + alpha)
  fdr <- 1 - precision
  return(c(precision = precision, fdr = fdr))
}

# Calculate precision and FDR for different theta values
theta_values <- seq(0.001, 0.5, by = 0.001)
results <- t(sapply(theta_values, calculate_precision_fdr))
results_df <- tibble(
  theta = theta_values,
  precision = results[, "precision"],
  fdr = results[, "fdr"]
)

# Plot
ggplot(results_df, aes(x = theta)) +
  geom_line(aes(y = precision, color = "Precision"), size = 1) +
  geom_line(aes(y = fdr, color = "FDR"), size = 1) +
  scale_color_manual(values = c("Precision" = "blue", "FDR" = "red")) +
  labs(
    title = "Precision and False Discovery Rate vs. Odds Ratio",
    subtitle = "Alpha = 0.05, Beta = 0.2",
    x = "Theta (Odds Ratio of True vs. False Strategies)",
    y = "Rate",
    color = "Metric"
  ) +
  theme_minimal() +
  geom_vline(xintercept = 0.05/0.8, linetype = "dashed") +
  annotate("text", x = 0.07, y = 0.5, 
           label = "Precision = 50%\nwhen θ = α/(1-β)")

Recent Advances: Combinatorial Purged Cross-Validation

  • Introduced by Lopez de Prado (2018)
  • Addresses two key problems in financial machine learning:
    1. Leakage from training to test sets due to serial correlation
    2. Selection bias under multiple testing

CPCV provides a framework for model selection that: - Purges training observations that overlap with test observations - Embargoes observations that follow test observations - Generates multiple train/test splits to assess model variance

Walk-Forward Testing vs. CPCV

Walk-Forward Testing

  • Traditional approach in finance
  • Training window followed by test window
  • Windows move forward in time
  • Limited number of test samples
  • Does not fully address selection bias

Combinatorial Purged CV

  • Training and test sets across all available data
  • Purging of overlapping observations
  • Embargo of subsequent observations
  • Many more test samples
  • Better estimate of out-of-sample performance

Practical Implementation of CPCV

# This is pseudocode for demonstration purposes
implement_cpcv <- function(returns, feature_data, model_func, 
                           n_splits = 5, purge_window = 20, embargo = 5) {
  
  # Define time indices
  T <- length(returns)
  indices <- 1:T
  
  # Create time-based folds
  fold_size <- floor(T / n_splits)
  folds <- list()
  
  for(i in 1:n_splits) {
    test_start <- (i-1) * fold_size + 1
    test_end <- min(i * fold_size, T)
    test_indices <- test_start:test_end
    
    # Apply purging: remove from training observations that overlap with test
    purge_before <- max(1, test_start - purge_window)
    purge_after <- min(T, test_end + purge_window)
    purge_indices <- purge_before:purge_after
    
    # Apply embargo: remove from training observations that follow test
    embargo_end <- min(T, test_end + embargo)
    embargo_indices <- (test_end + 1):embargo_end
    
    # Training indices are all indices except test, purge, and embargo
    train_indices <- setdiff(indices, unique(c(test_indices, purge_indices, embargo_indices)))
    
    folds[[i]] <- list(train = train_indices, test = test_indices)
  }
  
  # Run model on each fold
  results <- list()
  for(i in 1:length(folds)) {
    train_data <- feature_data[folds[[i]]$train, ]
    train_returns <- returns[folds[[i]]$train]
    test_data <- feature_data[folds[[i]]$test, ]
    test_returns <- returns[folds[[i]]$test]
    
    # Train model
    model <- model_func(train_data, train_returns)
    
    # Predict on test data
    predictions <- predict(model, test_data)
    
    # Evaluate performance
    performance <- evaluate_performance(predictions, test_returns)
    
    results[[i]] <- performance
  }
  
  return(results)
}

Making Investment Decisions with DSR

DSR Thresholds for Strategy Selection

  • DSR < 0.5: Likely false discovery (reject)
  • 0.5 ≤ DSR < 0.95: Possible true discovery (further testing)
  • DSR ≥ 0.95: Likely true discovery (accept)

Implementation Considerations

  • Monitor DSR through time
  • Re-evaluate strategies when DSR drops
  • Allocate capital based on DSR confidence
  • Diversify across uncorrelated strategies
Strategy Selection Framework
Strategy Sharpe DSR Decision
Strategy A 1.80 0.25 Reject
Strategy B 2.20 0.55 Further Testing
Strategy C 2.00 0.65 Further Testing
Strategy D 1.95 0.98 Accept

Meta-Labeling: A Complementary Approach

Meta-labeling separates the problem of side prediction (buy/sell) from the problem of bet sizing.

  • First Model: Predicts the direction (e.g., using technical indicators)
  • Second Model: Predicts the probability of success for the first model’s predictions

Benefits

  • Addresses class imbalance problem
  • Reduces false positives
  • Provides natural bet sizing
  • Complements DSR framework

Implementation

  1. Develop primary model for direction
  2. Label outcomes (success/failure)
  3. Train secondary model to predict success
  4. Size positions based on success probability

Bayesian Approaches to Backtest Evaluation

  • Traditional backtesting (frequentist) is vulnerable to overfitting
  • Bayesian methods offer advantages:
    • Incorporation of prior beliefs
    • Full posterior distributions instead of point estimates
    • Natural handling of model uncertainty

Bayesian Sharpe Ratio

  • Assumes a prior distribution for the Sharpe ratio
  • Updates based on observed returns
  • Results in posterior distribution
  • Provides probability intervals for true SR

Complexity-Adjusted Performance Metrics

  • Strategy complexity is a key factor in overfitting
  • More complex strategies have more degrees of freedom
  • Complexity-adjusted metrics penalize complexity:

Information-Theoretic Approaches

  • Akaike Information Criterion (AIC)
  • Bayesian Information Criterion (BIC)
  • Minimum Description Length (MDL)

Regularization Techniques

  • Ridge regression (L2 penalty)
  • LASSO regression (L1 penalty)
  • Elastic Net (combination of L1 and L2)

Example: AIC for Strategy Selection

Complexity-Adjusted Strategy Performance
Strategy Parameters Sharpe Ratio AIC Adjusted Sharpe
Moving Average Crossover 2 1.2 120 1.130685
Bollinger Band Strategy 5 1.5 150 1.339056
Multi-factor Model 12 1.8 210 1.551509
Deep Neural Network 150 2.1 350 1.598936

Ethical Considerations in Strategy Development

  • Researchers have ethical responsibility to report honest results
  • Investors trust performance metrics for capital allocation
  • Regulators rely on accurate disclosures

Best Practices:

  1. Pre-register testing protocols
  2. Report all trials and configurations
  3. Disclose DSR alongside Sharpe ratio
  4. Maintain research logs for audit trail
  5. Use out-of-sample validation periods

Industry Implementation: Case Study

AQR Capital Management: One of the pioneers in addressing backtest overfitting

Cliff Asness (AQR co-founder): “We aim to publish strategies with high out-of-sample Sharpe ratios, not just high backtest Sharpe ratios.”

AQR’s Approach: - Long out-of-sample periods (often decades) - Focus on economically justified factors - Implementation across multiple asset classes - Transparency in methodology - Publication of research and results

Putting It All Together: A Robust Workflow

Summary of Key Extensions

  • Practical implementation of DSR calculation
  • Methods to estimate effective number of trials
  • Combinatorial Purged Cross-Validation
  • Meta-labeling for bet sizing
  • Bayesian approaches to backtest evaluation
  • Complexity-adjusted performance metrics
  • Ethical considerations and best practices

Workshop Exercises

  1. False Discovery Estimation

    • Simulate strategy development and selection
    • Calculate false discovery rates
    • Implement DSR to correct for selection bias
  2. Robust Strategy Evaluation Framework

    • Develop walk-forward testing procedure
    • Estimate effective number of trials
    • Create decision framework for strategy selection

References

  • Lopez de Prado, M. (2018). “Advances in financial machine learning.” John Wiley & Sons.
  • Lopez de Prado, M. (2019). “A data science solution to the multiple-testing crisis in financial research.” Journal of Financial Data Science.
  • Bailey, D. H., & Lopez de Prado, M. (2014). “The deflated Sharpe ratio: Correcting for selection bias, backtest overfitting, and non-normality.” Journal of Portfolio Management.
  • Harvey, C. R., & Liu, Y. (2015). “Backtesting.” Journal of Portfolio Management.
  • Cherry, S., & Shallue, C. J. (2019). “Statistical significance and p-values in machine learning research.” ArXiv preprint.
  • Bollen, N. P. B., & Pool, V. K. (2009). “Do Hedge Fund Managers Misreport Returns? Evidence from the Pooled Distribution.” Journal of Finance.
  • Novy-Marx, R. (2016). “Testing strategies based on multiple signals.” Working paper, University of Rochester.
  • Gu, Shihao, Bryan Kelly, and Dacheng Xiu. 2020. “Empirical Asset Pricing via Machine Learning.” The Review of Financial Studies.
  • Harvey, Campbell R., Presidential Address: The Scientific Outlook in Financial Economics. 2017.
  • American Statistical Association. 2016. “Ethical guidelines for statistical practice.”
  • López de Prado, M. and M. Lewis. 2018. “Detection of False Investment Strategies Using Unsupervised Learning Methods.”
  • Bailey, D., J. Borwein, M. López de Prado, and J. Zhu. 2014. “Pseudo-mathematics and financial charlatanism: The effects of backtest overfitting on out-of-sample performance.”
  • Bailey, D., J. Borwein, M. López de Prado, and J. Zhu. 2017. “The Probability of Backtest Overfitting.”
  • Bailey, D. and M. López de Prado. 2012. “The Sharpe ratio efficient frontier.”
  • Bailey, D. and M. López de Prado. 2014. “The deflated Sharpe ratio: Correcting for selection bias, backtest overfitting and non-normality.”